Building the Customer and Product Modelling

1 Load Data

We first want to load our datasets and prepare them for some simple association rules mining.

tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")

tnx_data_tbl %>% glimpse()
## Rows: 1,021,424
## Columns: 23
## $ row_id            <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet       <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id        <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code        <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description       <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity          <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date      <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price             <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id       <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country           <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr    <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm      <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month     <chr> "December", "December", "December", "December", "Dec…
## $ invoice_dow       <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ invoice_dom       <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour      <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute    <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy       <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym        <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value       <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.

tnx_purchase_tbl <- tnx_data_tbl %>%
  filter(
    quantity > 0,
    price > 0,
    exclude == FALSE
    ) %>%
  select(
    invoice_id, invoice_date, stock_code, customer_id, quantity, price,
    stock_value, description
    )

tnx_purchase_tbl %>% glimpse()
## Rows: 992,072
## Columns: 8
## $ invoice_id   <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ stock_code   <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ customer_id  <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ quantity     <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ price        <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ stock_value  <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, …
## $ description  <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…

We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.

product_data_tbl <- read_rds("data/stock_code_lookup_tbl.rds")

product_data_tbl %>% glimpse()
## Rows: 4,733
## Columns: 2
## $ stock_code_upr <chr> "10002", "10002R", "10080", "10109", "10120", "10123C",…
## $ desc           <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", …

Finally, we set a date for our dataset before which we wish to train our data and use the remainder as our model validation.

training_data_date <- as.Date("2011-03-31")

2 Build Association Rules Model

We now build our association rules based on the lower support data.

The idea is to repeat some of the initial association rules analysis: we use the APRIORI algorithm to mine the rules, and then convert the discovered rules to produce a graph of the products and the rules.

With this graph, we then use the disjoint components of this graph to cluster the products, and take the largest subgraph and cluster that one according to some standard clustering.

2.1 Load Transaction Data

To build our rules, we first need to load the transactions in the format required for the arules package.

tnx_purchase_tbl %>%
  filter(invoice_date <= training_data_date) %>%
  select(invoice_id, stock_code) %>%
  write_csv("data/tnx_arules_input.csv")

basket_tnxdata <- read.transactions(
    file   = "data/tnx_arules_input.csv",
    format = "single",
    sep    = ",",
    header = TRUE,
    cols   = c("invoice_id", "stock_code")
    )

basket_tnxdata %>% glimpse()
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  4315 obs. of  1 variable:
##   .. ..$ labels: chr [1:4315] "10002" "10002R" "10080" "10109" ...
##   ..@ itemsetInfo:'data.frame':  24950 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:24950] "489434" "489435" "489436" "489437" ...

2.2 Construct Association Rules

Having loaded the individual transaction data we now construct our basket data and use the APRIORI algorithm to discover our rules.

basket_arules <- apriori(
    basket_tnxdata,
    parameter = list(supp = 0.005, conf = 0.1)
  )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 124 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4315 item(s), 24950 transaction(s)] done [0.18s].
## sorting and recoding items ... [1376 item(s)] done [0.01s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.14s].
## writing ... [14279 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
basket_arules_tbl <- basket_arules %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))

basket_arules_tbl %>% glimpse()
## Rows: 14,279
## Columns: 6
## $ rules      <chr> "{22459} => {22458}", "{22458} => {22459}", "{21801,21802} …
## $ support    <dbl> 0.005891784, 0.005891784, 0.005611222, 0.005611222, 0.00561…
## $ confidence <dbl> 0.8647059, 0.8546512, 0.8860759, 0.8695652, 0.8860759, 0.77…
## $ coverage   <dbl> 0.006813627, 0.006893788, 0.006332665, 0.006452906, 0.00633…
## $ lift       <dbl> 125.43263, 125.43263, 113.95667, 106.87513, 103.79153, 100.…
## $ count      <int> 147, 147, 140, 140, 140, 158, 158, 161, 161, 192, 158, 158,…

Having constructed the main association rules, we then convert the discovered rules into a graph.

apriori_rules_igraph <- basket_arules %>%
  plot(
    measure = "support",
    method  = "graph",
    control = list(max = 20000)
    ) %>%
  as("igraph")

apriori_rules_igraph %>% summary()
## IGRAPH 23e8c5d DN-- 14900 36743 -- 
## + attr: name (v/c), label (v/c), support (v/n), confidence (v/n),
## | coverage (v/n), lift (v/n), count (v/n), order (v/n)

Having constructed the graph, we now want to visualise it.

basket_arules %>%
  head(n = 500, by = "support") %>%
  plot(
    measure  = "lift",
    method   = "graph",
    engine   = "htmlwidget"
    )

2.3 Determine Graph Clusters

With the constructed graph we now want to label the elements that are part of the disjoint components of the graph.

apriori_rules_tblgraph <- apriori_rules_igraph %>%
  as_tbl_graph() %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

apriori_rules_tblgraph %>% glimpse()
## List of 10
##  $ :List of 1
##   ..$ 18: 'igraph.vs' Named int [1:19] 5089 5091 5093 5095 5097 5099 5101 5103 5105 5106 ...
##   .. ..- attr(*, "names")= chr [1:19] "assoc4468" "assoc4470" "assoc4472" "assoc4474" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 25: 'igraph.vs' Named int [1:8] 647 952 994 1246 1248 1250 8447 8449
##   .. ..- attr(*, "names")= chr [1:8] "assoc26" "assoc331" "assoc373" "assoc625" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 27: 'igraph.vs' Named int [1:11] 649 954 996 1247 2206 2208 2209 2210 2211 8448 ...
##   .. ..- attr(*, "names")= chr [1:11] "assoc28" "assoc333" "assoc375" "assoc626" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 29: 'igraph.vs' Named int [1:2] 646 648
##   .. ..- attr(*, "names")= chr [1:2] "assoc25" "assoc27"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 35: 'igraph.vs' Named int [1:3] 2859 2860 2861
##   .. ..- attr(*, "names")= chr [1:3] "assoc2238" "assoc2239" "assoc2240"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 99: 'igraph.vs' Named int [1:3] 3713 3714 3715
##   .. ..- attr(*, "names")= chr [1:3] "assoc3092" "assoc3093" "assoc3094"
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 216: 'igraph.vs' Named int [1:21] 685 1040 1042 1044 1046 1048 1050 1052 8507 8508 ...
##   .. ..- attr(*, "names")= chr [1:21] "assoc64" "assoc419" "assoc421" "assoc423" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 217: 'igraph.vs' Named int [1:137] 777 967 1043 1205 1581 1583 1585 1587 1589 1591 ...
##   .. ..- attr(*, "names")= chr [1:137] "assoc156" "assoc346" "assoc422" "assoc584" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 218: 'igraph.vs' Named int [1:120] 787 977 1053 1217 1594 1806 2044 2092 2274 2401 ...
##   .. ..- attr(*, "names")= chr [1:120] "assoc166" "assoc356" "assoc432" "assoc596" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  $ :List of 1
##   ..$ 219: 'igraph.vs' Named int [1:130] 783 973 1049 1213 1590 2040 2088 2270 2397 2836 ...
##   .. ..- attr(*, "names")= chr [1:130] "assoc162" "assoc352" "assoc428" "assoc592" ...
##   .. ..- attr(*, "env")=<weakref> 
##   .. ..- attr(*, "graph")= chr "23e8c5d1-44d0-4df9-b46b-57c0136ce010"
##  - attr(*, "class")= chr [1:2] "tbl_graph" "igraph"
##  - attr(*, "active")= chr "nodes"

From the graph, we extract the nodes that correspond to the products (as opposed to the nodes corresponding to the mined association rules). These are identified as the various numeric values attached to the rules are blank.

We also wish to add an additional column that is the size of the group, so it is easier to identify outsized subgraphs suitable for further partitioning.

product_cluster_disjoint_tbl <- apriori_rules_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(component_id) %>%
  mutate(
    cluster_size = n()
    ) %>%
  ungroup() %>%
  arrange(desc(cluster_size), label) %>%
  group_by(component_id) %>%
  mutate(
    product_group_id = sprintf("DISJOINT_%03d", cur_group_id()),
    cluster_size,
    stock_code       = label
    ) %>%
  ungroup() %>%
  select(product_group_id, cluster_size, stock_code) %>%
  arrange(product_group_id, stock_code)

product_cluster_disjoint_tbl %>% glimpse()
## Rows: 621
## Columns: 3
## $ product_group_id <chr> "DISJOINT_001", "DISJOINT_001", "DISJOINT_001", "DISJ…
## $ cluster_size     <int> 520, 520, 520, 520, 520, 520, 520, 520, 520, 520, 520…
## $ stock_code       <chr> "15036", "15056BL", "15056N", "15056P", "15060B", "16…

We now segment up the largest disjoint subgraph using alternative clustering techniques.

We try a few different types - inspecting the output of the various algorithms to see which clustering may be the

run_subgraph_clusters <- function(graph_cluster_func, rules_tblgraph, ...) {
  subgraph_clusters_tbl <- rules_tblgraph %>%
    to_subgraph(component_size == max(component_size)) %>%
    use_series(subgraph) %>%
    morph(to_undirected) %>%
    mutate(
      sub_id = graph_cluster_func(...)
      ) %>%
    unmorph() %>%
    activate(nodes) %>%
    as_tibble() %>%
    filter(are_na(support)) %>%
    count(sub_id, name = "cluster_size", sort = TRUE) %>%
    mutate(
      sub_id = factor(1:n(), levels = 1:n())
    )
  
  return(subgraph_clusters_tbl)
}

cluster_func <- c(
    "group_fast_greedy",
    "group_infomap",
    "group_label_prop",
    "group_spinglass"
    )

cluster_data_tbl <- tibble(cluster_func_name = cluster_func) %>%
  mutate(
    cluster_func = map(cluster_func_name, get),
    clustered    = map(cluster_func, run_subgraph_clusters,
                       rules_tblgraph = apriori_rules_tblgraph)
    ) %>%
  select(cluster_func_name, clustered) %>%
  unnest(clustered)

cluster_data_tbl %>% glimpse()
## Rows: 478
## Columns: 3
## $ cluster_func_name <chr> "group_fast_greedy", "group_fast_greedy", "group_fas…
## $ sub_id            <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ cluster_size      <int> 96, 46, 44, 43, 42, 35, 31, 25, 22, 12, 11, 11, 10, …

Having split this largest component into various splits, we now visualise the count and size of each cluster and use this to determine which clustering splits the data into a smaller number of larger clusters.

ggplot(cluster_data_tbl) +
  geom_col(aes(x = sub_id, y = cluster_size)) +
  geom_hline(aes(yintercept = 5), colour = "red") +
  facet_wrap(vars(cluster_func_name), scales = "free") +
  labs(
    x = "ID",
    y = "Cluster Size"
    ) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 8))

From this, it appears that fast_greedy is the method of choice.

Thus, we re-run the clustering for this larger component

subgraph_groups_tbl <- apriori_rules_tblgraph %>%
  to_subgraph(component_size == max(component_size)) %>%
  use_series(subgraph) %>%
  morph(to_undirected) %>%
  mutate(
    sub_id = group_fast_greedy()
    ) %>%
  unmorph() %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(sub_id) %>%
  mutate(
    cluster_size = n()
    ) %>%
  ungroup() %>%
  arrange(desc(cluster_size), label) %>%
  group_by(sub_id) %>%
  mutate(
    product_group_id = sprintf("LARGE_%03d", cur_group_id()),
    cluster_size,
    stock_code       = label
    ) %>%
  ungroup() %>%
  select(product_group_id, cluster_size, stock_code) %>%
  arrange(product_group_id, stock_code)
  

subgraph_groups_tbl %>% glimpse()
## Rows: 520
## Columns: 3
## $ product_group_id <chr> "LARGE_001", "LARGE_001", "LARGE_001", "LARGE_001", "…
## $ cluster_size     <int> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 1…
## $ stock_code       <chr> "20719", "20723", "20724", "20961", "20963", "20966",…

We now combine both these lists of groupings and combine them.

product_cluster_tbl <- list(
    product_cluster_disjoint_tbl,
    subgraph_groups_tbl
    ) %>%
  bind_rows() %>%
  filter(product_group_id != "DISJOINT_001")

product_cluster_tbl %>% glimpse()
## Rows: 621
## Columns: 3
## $ product_group_id <chr> "DISJOINT_002", "DISJOINT_002", "DISJOINT_002", "DISJ…
## $ cluster_size     <int> 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ stock_code       <chr> "22520", "22521", "22522", "22523", "22524", "22525",…

2.4 Assign Products to Groups

We now want to look at our complete list of products and then assign them to each of our product groups. In terms of coverage, we need to check to see if all the products appearing in the most invoices.

We also want to look at the most commonly purchased items (in terms of appearance in baskets as opposed to quantity sold).

product_popular_tbl <- tnx_purchase_tbl %>%
  mutate(
    stock_code = str_to_upper(stock_code)
    ) %>%
  count(stock_code, name = "invoice_count", sort = TRUE)

product_popular_tbl %>% glimpse()
## Rows: 4,729
## Columns: 2
## $ stock_code    <chr> "85123A", "85099B", "22423", "21212", "20725", "84879", …
## $ invoice_count <int> 5464, 4014, 3916, 3118, 3056, 2807, 2674, 2434, 2398, 23…

We now combine this data to construct a product dataset containing the relevant summary data about each product.

product_data_full_tbl <- product_data_tbl %>%
  rename(stock_code = stock_code_upr) %>%
  left_join(product_cluster_tbl, by = "stock_code") %>%
  left_join(product_popular_tbl, by = "stock_code") %>%
  replace_na(
    list(product_group_id = "none", cluster_size = "0")
    ) %>%
  arrange(desc(invoice_count)) %>%
  mutate(ranking = 1:n())

product_data_full_tbl %>% glimpse()
## Rows: 4,733
## Columns: 6
## $ stock_code       <chr> "85123A", "85099B", "22423", "21212", "20725", "84879…
## $ desc             <chr> "? : CREAM HANGING HEART T-LIGHT HOLDER : WHITE HANGI…
## $ product_group_id <chr> "LARGE_004", "LARGE_005", "LARGE_011", "LARGE_002", "…
## $ cluster_size     <chr> "107", "19", "31", "66", "30", "107", "17", "33", "30…
## $ invoice_count    <int> 5464, 4014, 3916, 3118, 3056, 2807, 2674, 2434, 2398,…
## $ ranking          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…

First, let us export the table to help us inspect the data.

product_data_full_tbl %>% datatable()

To make it more obvious, we look at the products unassigned to a group and see how they rank in terms of invoice count.

product_data_full_tbl %>% filter(product_group_id == "none") %>% datatable()

3 Construct RFM Customer Segments

We now wish to repeat our RFM analysis, and then we reassign the customer base to each of these groupings.

segment_names <- c(
  "Champions", "Loyal Customers", "Potential Loyalist", "New Customers",
  "Promising", "Need Attention", "About To Sleep", "At Risk",
  "Can't Lose Them", "Lost"
  )

recency_lower   <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper   <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper  <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)

segment_defs_tbl <- tibble(
  segment_names,
  recency_lower,
  recency_upper,
  frequency_lower,
  frequency_upper,
  monetary_lower,
  monetary_upper
  )

segment_defs_tbl %>% glimpse()
## Rows: 10
## Columns: 7
## $ segment_names   <chr> "Champions", "Loyal Customers", "Potential Loyalist", …
## $ recency_lower   <dbl> 4, 2, 3, 4, 3, 2, 2, 1, 1, 1
## $ recency_upper   <dbl> 5, 5, 5, 5, 4, 3, 3, 2, 1, 2
## $ frequency_lower <dbl> 4, 3, 1, 1, 1, 2, 1, 2, 4, 1
## $ frequency_upper <dbl> 5, 5, 3, 1, 1, 3, 2, 5, 5, 2
## $ monetary_lower  <dbl> 4, 3, 1, 1, 1, 2, 1, 2, 4, 1
## $ monetary_upper  <dbl> 5, 5, 3, 1, 1, 3, 2, 5, 5, 2

We first visually inspect these segment definitions and the bands.

segments_show_tbl <- segment_defs_tbl %>%
  mutate(
    recency   = glue("{recency_lower}-{recency_upper}")     %>% as.character(),
    frequency = glue("{frequency_lower}-{frequency_upper}") %>% as.character(),
    monetary  = glue("{monetary_lower}-{monetary_upper}")   %>% as.character()
    ) %>%
  select(
    segment_names, recency, frequency, monetary
    )

segments_show_tbl %>%
  datatable(
    colnames = c("Segment", "R", "F", "M"),
    options = list(
      columnDefs = list(list(className = 'dt-left', targets = 0:4))
      )
    )

We now construct the RFM data from the purchase data and assign each of the customers to a segment based on their RFM score.

There is a reasonable number of transactions with a missing customer_id, so we exclude this from the analysis.

customer_rfmdata <- tnx_purchase_tbl %>%
  filter(
    !are_na(customer_id),
    invoice_date <= training_data_date
    ) %>%
  group_by(invoice_date, customer_id) %>%
  summarise(
    .groups = "drop",
    
    total_spend = sum(stock_value)
    ) %>%
  rfm_table_order(
    customer_id   = customer_id,
    order_date    = invoice_date,
    revenue       = total_spend,
    analysis_date = training_data_date
    )

customer_rfmdata %>% print()
## # A tibble: 4,701 x 9
##    customer_id date_most_recent recency_days transaction_count amount
##    <chr>       <date>                  <dbl>             <dbl>  <dbl>
##  1 12346       2011-01-18                 72                 8 77556.
##  2 12347       2011-01-26                 64                 3  1799.
##  3 12348       2011-01-25                 65                 3   860.
##  4 12349       2010-10-28                154                 2  2221.
##  5 12350       2011-02-02                 57                 1   294.
##  6 12351       2010-11-29                122                 1   301.
##  7 12352       2011-03-22                  9                 6   985.
##  8 12353       2010-10-27                155                 1   318.
##  9 12355       2010-05-21                314                 1   488.
## 10 12356       2011-01-18                 72                 4  5069.
## # … with 4,691 more rows, and 4 more variables: recency_score <int>,
## #   frequency_score <int>, monetary_score <int>, rfm_score <dbl>

3.1 Visualise RFM Data

As we explored earlier, the rfm package provides a number of inbuilt descriptive visualisations.

First we look at the count of customers at each order count:

customer_rfmdata %>% rfm_order_dist()

We also have a few summary plots - showing the histograms of the recency, frequency and monetary measures.

rfm_plot <- customer_rfmdata %>% rfm_histograms(print_plot = FALSE)

rfm_plot +
  scale_x_continuous(labels = label_comma())

Finally, we look at each of the three bivariate plots to explore the relationship between the three quantities.

customer_rfmdata %>%
  rfm_rm_plot(print_plot = FALSE) +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma())

customer_rfmdata %>%
  rfm_rf_plot()

customer_rfmdata %>%
  rfm_fm_plot(print_plot = FALSE) +
    scale_x_log10(labels = label_comma()) +
    scale_y_log10(labels = label_comma())

3.2 Assign Customer Segments

customer_segments_tbl <- customer_rfmdata %>%
  rfm_segment(
    segment_names   = segment_names,
    recency_lower   = recency_lower,
    recency_upper   = recency_upper,
    frequency_lower = frequency_lower,
    frequency_upper = frequency_upper,
    monetary_lower  = monetary_lower,
    monetary_upper  = monetary_upper
    )

customer_segments_tbl %>% glimpse()
## Rows: 4,701
## Columns: 9
## $ customer_id       <chr> "12346", "12347", "12348", "12349", "12350", "12351"…
## $ segment           <chr> "Champions", "Loyal Customers", "Loyal Customers", "…
## $ rfm_score         <dbl> 455, 434, 433, 224, 412, 312, 543, 212, 112, 445, 31…
## $ transaction_count <dbl> 8, 3, 3, 2, 1, 1, 6, 1, 1, 4, 1, 3, 8, 3, 4, 1, 1, 1…
## $ recency_days      <dbl> 72, 64, 65, 154, 57, 122, 9, 155, 314, 72, 135, 122,…
## $ amount            <dbl> 77556.46, 1798.71, 859.80, 2221.14, 294.40, 300.93, …
## $ recency_score     <int> 4, 4, 4, 2, 4, 3, 5, 2, 1, 4, 3, 3, 4, 3, 4, 4, 4, 1…
## $ frequency_score   <int> 5, 3, 3, 2, 1, 1, 4, 1, 1, 4, 1, 3, 5, 3, 4, 1, 1, 1…
## $ monetary_score    <int> 5, 4, 3, 4, 2, 2, 3, 2, 2, 5, 5, 5, 5, 4, 2, 2, 2, 3…

4 R Environment

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       Ubuntu 20.04.2 LTS          
##  system   x86_64, linux-gnu           
##  ui       RStudio                     
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2021-07-31                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  arules      * 1.6-7   2021-03-16 [1] RSPM (R 4.0.4)
##  arulesViz   * 1.4-0   2021-03-07 [1] RSPM (R 4.0.3)
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.0.3)
##  backports     1.2.1   2020-12-09 [1] RSPM (R 4.0.3)
##  bookdown      0.21    2020-10-13 [1] RSPM (R 4.0.2)
##  broom         0.7.5   2021-02-19 [1] RSPM (R 4.0.3)
##  bslib         0.2.4   2021-01-25 [1] RSPM (R 4.0.3)
##  cachem        1.0.4   2021-02-13 [1] RSPM (R 4.0.3)
##  cellranger    1.1.0   2016-07-27 [1] RSPM (R 4.0.3)
##  cli           2.3.1   2021-02-23 [1] RSPM (R 4.0.3)
##  codetools     0.2-18  2020-11-04 [2] CRAN (R 4.0.4)
##  colorspace    2.0-0   2020-11-11 [1] RSPM (R 4.0.3)
##  conflicted  * 1.0.4   2019-06-21 [1] RSPM (R 4.0.0)
##  cowplot     * 1.1.1   2020-12-30 [1] RSPM (R 4.0.3)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.0.3)
##  crosstalk     1.1.1   2021-01-12 [1] RSPM (R 4.0.3)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.0.3)
##  dbplyr        2.1.0   2021-02-03 [1] RSPM (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] RSPM (R 4.0.3)
##  dplyr       * 1.0.5   2021-03-05 [1] RSPM (R 4.0.3)
##  DT          * 0.17    2021-01-06 [1] RSPM (R 4.0.3)
##  ellipsis      0.3.1   2020-05-15 [1] RSPM (R 4.0.3)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.0.3)
##  fansi         0.4.2   2021-01-15 [1] RSPM (R 4.0.3)
##  farver        2.1.0   2021-02-28 [1] RSPM (R 4.0.3)
##  fastmap       1.1.0   2021-01-25 [1] RSPM (R 4.0.3)
##  forcats     * 0.5.1   2021-01-27 [1] RSPM (R 4.0.3)
##  foreach       1.5.1   2020-10-15 [1] RSPM (R 4.0.3)
##  fs            1.5.0   2020-07-31 [1] RSPM (R 4.0.3)
##  furrr       * 0.2.2   2021-01-29 [1] RSPM (R 4.0.3)
##  future      * 1.21.0  2020-12-10 [1] RSPM (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] RSPM (R 4.0.3)
##  ggplot2     * 3.3.3   2020-12-30 [1] RSPM (R 4.0.3)
##  globals       0.14.0  2020-11-22 [1] RSPM (R 4.0.3)
##  glue        * 1.4.2   2020-08-27 [1] RSPM (R 4.0.3)
##  gtable        0.3.0   2019-03-25 [1] RSPM (R 4.0.3)
##  haven         2.3.1   2020-06-01 [1] RSPM (R 4.0.3)
##  highr         0.8     2019-03-20 [1] RSPM (R 4.0.3)
##  hms           1.0.0   2021-01-13 [1] RSPM (R 4.0.3)
##  htmltools     0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
##  htmlwidgets   1.5.3   2020-12-10 [1] RSPM (R 4.0.3)
##  httr          1.4.2   2020-07-20 [1] RSPM (R 4.0.3)
##  igraph        1.2.6   2020-10-06 [1] RSPM (R 4.0.3)
##  iterators     1.0.13  2020-10-15 [1] RSPM (R 4.0.3)
##  jquerylib     0.1.3   2020-12-17 [1] RSPM (R 4.0.3)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.0.3)
##  knitr         1.31    2021-01-27 [1] RSPM (R 4.0.3)
##  labeling      0.4.2   2020-10-20 [1] RSPM (R 4.0.3)
##  lattice       0.20-41 2020-04-02 [2] CRAN (R 4.0.4)
##  lifecycle     1.0.0   2021-02-15 [1] RSPM (R 4.0.3)
##  listenv       0.8.0   2019-12-05 [1] RSPM (R 4.0.3)
##  lubridate     1.7.10  2021-02-26 [1] RSPM (R 4.0.3)
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.0.3)
##  Matrix      * 1.3-2   2021-01-06 [2] CRAN (R 4.0.4)
##  memoise       2.0.0   2021-01-26 [1] RSPM (R 4.0.3)
##  modelr        0.1.8   2020-05-19 [1] RSPM (R 4.0.3)
##  munsell       0.5.0   2018-06-12 [1] RSPM (R 4.0.3)
##  parallelly    1.24.0  2021-03-14 [1] RSPM (R 4.0.3)
##  pillar        1.5.1   2021-03-05 [1] RSPM (R 4.0.3)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.0.3)
##  purrr       * 0.3.4   2020-04-17 [1] RSPM (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] RSPM (R 4.0.3)
##  Rcpp          1.0.6   2021-01-15 [1] RSPM (R 4.0.3)
##  readr       * 1.4.0   2020-10-05 [1] RSPM (R 4.0.4)
##  readxl        1.3.1   2019-03-13 [1] RSPM (R 4.0.3)
##  registry      0.5-1   2019-03-05 [1] RSPM (R 4.0.0)
##  reprex        1.0.0   2021-01-27 [1] RSPM (R 4.0.3)
##  rfm         * 0.2.2   2020-07-21 [1] RSPM (R 4.0.2)
##  rlang       * 0.4.10  2020-12-30 [1] RSPM (R 4.0.3)
##  rmarkdown     2.7     2021-02-19 [1] RSPM (R 4.0.3)
##  rmdformats    1.0.1   2021-01-13 [1] RSPM (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.0.3)
##  rvest         1.0.0   2021-03-09 [1] RSPM (R 4.0.3)
##  sass          0.3.1   2021-01-24 [1] RSPM (R 4.0.3)
##  scales      * 1.1.1   2020-05-11 [1] RSPM (R 4.0.3)
##  seriation     1.2-9   2020-10-01 [1] RSPM (R 4.0.2)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.0.3)
##  stringi       1.5.3   2020-09-09 [1] RSPM (R 4.0.3)
##  stringr     * 1.4.0   2019-02-10 [1] RSPM (R 4.0.3)
##  tibble      * 3.1.0   2021-02-25 [1] RSPM (R 4.0.3)
##  tidygraph   * 1.2.0   2020-05-12 [1] RSPM (R 4.0.3)
##  tidyr       * 1.1.3   2021-03-03 [1] RSPM (R 4.0.4)
##  tidyselect    1.1.0   2020-05-11 [1] RSPM (R 4.0.3)
##  tidyverse   * 1.3.0   2019-11-21 [1] RSPM (R 4.0.3)
##  TSP           1.1-10  2020-04-17 [1] RSPM (R 4.0.0)
##  utf8          1.2.1   2021-03-12 [1] RSPM (R 4.0.3)
##  vctrs         0.3.7   2021-03-29 [1] RSPM (R 4.0.4)
##  visNetwork    2.0.9   2019-12-06 [1] RSPM (R 4.0.3)
##  withr         2.4.1   2021-01-26 [1] RSPM (R 4.0.3)
##  xfun          0.22    2021-03-11 [1] RSPM (R 4.0.3)
##  xml2          1.3.2   2020-04-23 [1] RSPM (R 4.0.3)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.0.3)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library